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(57) Abstract 



The present invention is a computer apparatus and method for adding speech interpreting capabilities to an interactive voice response 
system. An annotated corpus is used to list valid utterances within a grammar along with token data for each valid utterance represeniing 
the meaning behind the valid utterance. When valid utterances are detected, the interactive voice response system requests that a search 
is made through the annotated corpus to find the token identified with the valid utterance. This token is returned to the interactive voice 
response system. If the valid utterance included a variable, additional processing is performed to interpret the variable and return additional 



response system 
data rcpi^cnting the variable. 
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SYSTEM AKD METHOD USING NATURAL LANGUAGE UNDERSTANDING FOR SPEECH CONTROL APPLICATION 



Fi^ld of the Invention 

5 

This invention relates generally tc 
computerized natural language systems. More 
particularly, it relates to a computer system and method 
for providing speech understanding capabilities i:o an 
10 interactive voice response system. It fui t:her relates to 
a computer system and method for interpreting spoken 
utterances in a constrained speech recognition 
application.. 

15 Description of the Related Art 

Computers have become a mainstay ia our 
everyday lives. Many of us spend hours a day using the 
machines at work, home and even while shopping. Using a 
20 "computer, however, has always been on the machine' s 

■ terms. A mouse, pushbuttons and keyboards have' always 
been somewhat of an unnatural way to tell the computers 
what we want. However, as computer technology continues 
to advance, the computer is edging towards communicating 
25 with humans on our terms: the spoken word. 

There are essentially two steps in creating a 
computer that can speak with humans. First, the computer 
needs an automatic speech recognition to detect the 
30 spoken words and convert them into some form of computer- 
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' readaole data, such as simple text. Second, the compurer 
needs some way to analyze the compucer-readable daca and 
determine what those v/orcs, as they were used, meant - 
This secvond step typically employs some form of 
5 artificial intelligence, and there are several basic 
approached researchers have taken to develop a system 
that can extract meaninq from words. 

One such approach involves s-F.-ls-ical 
10 computational linguistics. This approach relies on the 
relatively ^vedictable nature of human speech. 
Statistical computa t ioni. 1 linquistics begins v;i':h a 
■corpus, v;hlch is a list of sample utterances contained in 
the grammar. This corpus is analyzed and statistical 
J5 properties of the grammar are extracted. These 

statisticc'il properties are implemented in rules, which 
are then applied to new, spoken utterances in an attempt 
to statistically *'guess" the meaning of what v/as said. 

20 Because of the large number of possible 

utterances in any language (English, Chinese, German, 
etc.), no corpus-based language system attempts to list 
the full set of valid utterances in that language. Some 
systems, however, have attempted to reduce the number of 

25 possible utterances by constraining, or restricting, the 
valid ones to those in a predefined grammar. For 
example, U.S. Patent (application 08/066,747, allowed 
12/13/96) , issued to Linebarger, et al, assigned to 
Unisys Corporation, Blue Bell, Pa. and incorporated 

30 herein by reference, teaches a language processor that 
only understands air traffic control instructions. 
There, the air traffic controller's sentence was 
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segmented into individual instructions, v/hich were then 
individually processed zc determine Tiheir meaning. 
Unfortunately, this type of processing can quickly 
consume much computing power v/hen the valid gramniar is 
increased from the relatively limited vocabulary of air 
traffic controls to, for example, a bank automated teller 
machine that can handle all sorts of transactions. 



10 



15 



25 



Other natural language systems may allov; for ^ 
fui] range of utterances, but this hi.gh degree' of 
qenerali-y also requires much computing power. VJhat is 
p.eeded is a language undersr.anding sysrem thac can 
interpret speech ir. a constrained grammar that does not 
require the full generality of a natural language system. 



Summary of the Invention 

A- general purpose of the present invention is 
20 to provide system and method for providing constrained 

speech understanding capabilities to an interactive voice 
recognition system. v 



Another object of the present invention is to 
provide system and method for simplifying the task of . 
interpreting the meaning behind . a spoken utterance. 



A further object of the present invention is to 
provide system and method for creating a corpus-based 
30 speech recognition system' that is highly accurate in its 
interpretation of the meaning behind a spoken utterance. 
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4 



A still further objecT: of the present invention 
is to provide a systerr! and method for employing a 
plurality of runtime interpreters that are connected to 
5 the interactive voice response syscem by a com.puter 
networ k . 

Theses and other objects are accomplished by the 
presen- invenclon v^hich p»rovides a runtime interpreter 

10 for decermininq the meaning behind a spoken ui'.-erance 

v;it;": a simple search of ^ list. The runtime interpreter 
receives, as input, an anr-otated corpus v/hicn is list 
of valid utterances, context identifiers for each valia 
utterance, and token data for each valid utterance 

15 representing the meaning behind the utterance. • The 

runtime interpreter also receives, as input, an utterance 
in text form which is to be found in the corpus. 

When the runtime interpreter is given an 
20 utterance to interpret, the runtime interpreter searches 
through the corpus, locates the valid utterance being 
searched for, and returns the token which represents the 
meaning of the valid utterance - 

25 The runtime interpreter also supports the use 

of variables to reduce the size of the corpus. Some 
utterances may include numbers, dates, i:imes or other 
elements that have too many combinations to enumerate in 
the corpus. For example, the utterance "My birthday is 

30 xxx", where 'zxx' is the day of the year, could result in 
366 corpus entries, one for each possible day of the year 
(including leap day). In the present invention, however. 
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a variable would be used to represent the date. Thus, a 
reduced corpus i-iouLd include just one encry for this 
utterance: ^'My birthday is |DATE]"- The runtime 
interpreter is able to identify these variables in the 
5 corpus, and performs additional process inc during runtime 
to interpret the variables. The variable values, once 
interpreted, are then stored in a predefined data 
structure associated v/:. th the token whose utterance 
incJuded the variable. This variable value can then be 
10 retrieved by the iriceractive voice response system. 

The present invention also provides a custom 
processor interface which allows the developer of the 
interactive voice response system the ability to 
15 customize the operation of the runtime interpreter, 
without actually modifying the interpreter itself. 

Furthermore, the present invention provides for 
a system and metnod for using a plurality of interpreters 
20 that are connected to a computer network. Distributed 
interpreters are provided which include the same custom 
processor interface and runtime interpreter mentioned 
above. The distributed interpreters, however, include an 
additional manager for controlling messaging between the 
25 distributed interpreter and the computer netv/ork. A 

resource manager is alsc provided, which keeps track of 
the distributed interpreters that are connected to the 
network and manages their use by an interactive voice 
response system. 

30 
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Bri^f Description of the Drawings 

Figure 1 depicts an overview of an embedded 
natural language understanding system. 
5 Figure 2 j.s a table shov/ing the variable types 

supported in the preferred embodiment. 

Figure 3 depicr.s sample formats for the 
annotated A3R corpus files and vender-specific ASP. 
grammar file. 

10 Figure 4 is a flov>' diagram depi.cting the 

operation of the IVR as it accesses tr:e runtime 
int erprete r . 

Figure 5 depiccs the distributed system 
architecture . 

15 

Description of the Preferred Etflbodiment 

Before describing the present invention, 
several term.s need to be defined. These terms, and their 
20 definitions, include: 

annotated ASR corpus' file - data file 
containing a listing of valid utterances in a grammar, as 
well as token data for each valid utterance which ■ . ... . 
represents the meaning of the valid utterance to the 
25 interactive voice recognition system (IVR 130) . 

automatic speech recognition (ASR) - generic 
term for computer hardware and software that are capable 
of identifying spoken words and reporting them in a 
computer-readable format, such as text (characters) . 
30 cells - discrete elements within the table (the 

table is made up of rows and columns of cells) . In the 
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example rule given v/ith the definition of 'rules' below, 
each of "I vjant", "1 need" and "food" would be placed in 
^ a ceil. Furthermore, in the preferred embodiment, the 
cells containing ^'I v/ant" and "I need" are vertically 
adjacent one another {same column) . Vertically 
adjacenr. cells are generally OR' d together. The cell 
containinq "food", hov/ever, would occur in the column to 
the right of the "1 v/ant" and "I need" column, indicating 

t^he__facc uhat ''food" must follow either "I wan"" or "1 

10 need" and as such, the ceil conr.aining ''food" v/ill be 
AHD'd to follow the cells containing "T want" and "I 
need". 

constrained grammar - a grammar that does not 
include each and every possible statement in the 
15 speaker's language; limits the range of acceptable 
statements. 

corpus - a large list. 

grammar - the entire language that is to be 
understood. Grammars can be e>:pressed using a set of 
20 rules, or by listing each and every statement thar is 
allowed vjithin the grammar. 

grammar development toolkit (104) - software 
used to create a grammar and the set of rules 
representing the grammar. 
25 natural language understanding - identifying 

the meaning behind spoken statements that are spoken in a 
normal manner. 

phrase - nhe "building blocks" of the grammar, 
a phrase is a word, group of words, or variable that 
30 occupies an entire cell within the table. 

rules - these define the logic of the grammar. 
An example rule is: ( '^ I want" I "I need" )(" food" ) , which 
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defines a grammar that: consists solely of statements that 
begin v;i::*n "I v/ant" OR "I need", AMD are immediately 
followed v/ith "food". 

..runtime interpreter (12^ » - software that 
5 searches tnrough the annotated corpus (122) v/henever a 
valid utterance is heard, and returns a token 
representing the meaning of the valid utterance, 

runtime interpreter application program 
interface fRIAP]) - set of software functions that serve 
10 as chie interface through which the interactive voice 
response system fl30) uses the runtime interpreter . 

speech recognizer (IIG) - combination of 
hardware and software that is capable of detecting and 
identifying spoken words. 
15 speech recognizer compiler (114) - softv/are 

included with a speech recognizer (116) that accepts, as 
input, a vendor-specific ASR grammar file (112) and 
processes the file (112) for use in a speech recognizer 
(116) during runtime. 
20 table - two dimensional grid used to represent 

a grammar. Contents of a table are read, in the 
preferred embodiment, from left to right. 

token - each valid utterance in the table is 
followed by a cell that contains a token, where the token 
25 is a unique data value (created by the developer when 
s/he develops the grammar) that will represent the 
meaning of that valid utterance to the interactive voice 
response system (130). 

utterance - a statement. 
30 utterance, spoken - an utterance that was said 

aloud. The spoken utterance might also be a valid 
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Utterance, iC the spoken utterance follows the rules ot 

the grarrimar . 

Utterance, valid - an utterance that is eound 
within the grammar. A valid utterance follows the rules 
•S which define the grammar. 

variable - "place holder" used in the corpus 
(122) to represent a phrase which has too many 
possibilities to fully enumerate. For example, the 
utterance "My favorite number between one and a rr.illion 
10 is y,y.y" could result in 999,998 corpus entries, one for 

each possible number. In the present invention, however, 
a variable would be used represent the number in the 
corpus (122). Thus, a reduced corpus (122) would include 
just one entry for this utterance: ^My favorite number 
rl5 between one and a million is [INTEGER]". The runtime 
interpreter (124) is able to identify this variable in 
the corpus, and performs additional processing during 
runtime to interpret the number. 

.vendor-specific ASR grammar file (112) - a data 
20 file tha- contains the set of rules representing a 
grammar, and is written in a format that will be 
recognized by the speech recognizer compiler {ll'?).- 

Referring now to the drawings, where elements 
25 that appear in several drawings are given the same 

element number throughout the drawings, the structures 
necessary to implement a preferred embodiment- of an 
embedded natural language understanding system (100) are 
shown in figure 1. The basic elements comprising: 
30 an Interactive voice response system (130), or 

IVR; 

the grammar development toolkit (104); 
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a compiler (114) and speech recognizer (116) 
that are part of an automatic speech recognition (ASR) 
system (118); 

an annotated automatic speech recogniriori (ASR) 
5 corpus file (122); 

a vendor-specific ASR granunar file(112); 

LP.e runtime interpreter {12''1); and 

the custom processor interface (126) ^ or CP; 

and 

10 the runtime interpreter application program 

interface fl2c), or RIAPI. These Cvlement:s v/il.l be 
discussed in detail further below, but an initial 
overview of the embedded arcnitecture will be helpful to 
a full understanding of the elements and their roles. 

15 

1. Overview of Embedded Architecture 

The following overviev/ discusses the embedded 
architecture, which employs a single runtime interpreter 
20 (124). There is a second, distributed, architecture 

vjhich employs a plurality of runtime interpreters. The 
distributed architecture v/ill be discussed further below. 

The first step in implementing a natural 
25 language system is creating the set of rules that govern 
the valid utterances in the grammar. As an example, a 
grammar for the reply to the question: "what do you want 
for lunch?" might be represented as: 
<reply>: { (^^I want'T'I'd 
30 like") (^^hotdogs" I "hamburgers") ) ; 

Under this set of rules, all valid replies consists of 
two parts: 1) either ''1 want" or "I'd like", followed by 
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2) either "hot dogs" or "hamburgers". This notation is 
referred as Back'js-Maur -Form (BNF), a form of v-jrammar 
that uses logicai AMDS and ORs . The preferred embodiment 
of the present invention generates this type of grammar. 

5 

Referring -o Figure 1, the grammar is generated 
by a developer using the grammar development tioolkit 
(104). Ir. the preferred embodiment, the toolkit (10^) is 
deve.loped using a coxiputer that has an Tntej-based 
10 central processing unit (CPU 102) (such as the Intel 

Pentium-.: I with Microsoft Visual Basicf-b as the software 
deveiooment prograrri. Tr-e computer also contains randoir. 
access memory (RAM 106), memory files (lOS) stored in 
system mem.ory, and keyboard (110). 

15 

The toolkit (104) is a novel spreadsheet- 
oriented softv/are package that provides the developer of 
a natural language application with a simplified way of 
generating a grammar. 

20 

When the developer has completed the grammar 
using the toolkit (104), two outputs are generated by the 
toolkit (104) for use in the natural language system. 
The first such output is a vendor-specific ASR grammar 
25 file (112), which is saved in a format that will be 

recognizable by the autom.atic speech recognition system, 
or ASR (118) . The ASR system (118) includes two parts, a 
compiler (114) and the actual speech recognizer (116) . 
In the preferred embodiment, speech recognizer (116) is a 
30 continuous speech, speaker independent speech recognizer. 
Commercially available speech recognizers (116) include 
the ASR-1500, manufactured by Lernout & Hauspie; Watson 
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2.0, manufactured by AT&T; and Nuance 5.0, by Nuance. 
The preferred embodiment of the toolkit (104) is able to 
generate grammar files for any of these recognizers. 

5 The vendo r:-speci f ic ASR gramiT'.ar file (112) 

contains information regarding the vjords and phrases that 
the speec'r: recognizer (116) will be required to 
recognize, vjritten in a form that is compatible with the 
recognizer. The file is also optimized to take advantage 

!() of peculiarities relating to the chosen speech recognizer 
{116). For example, experience with the l.&H recognizers 
has sho'wn thau L&H granunars perform well if the grammar 
avoids having multiple rules with the same beginning 
(three rules starting with ''I want") . Optimization of a 

15 grammar for an L&H recognizer would rewrite a set of 

rules from <rulel> : (ab) 1 (ac) ! ( ad) , to <rule2> : a (b I c ! d) . 
Here the three rules of 'rulel' have been rewritten and 
combined into the one rule of ^rule2' . 

20 In order to operate and recognize speech, the 

speech recognizer will need to compile the vendor- 
specific ASR grammar file (112) using a compiler tool 
(114) supplied by the ASR system (118) vendor. The 
preferred embodiment of the toolkit (104) knows, when" the 

25 grammar is first generated, which speech recognizer (116) 
will be used and is able to format the vendor-specific 
ASR grammar file (112) accordingly. 

The second output from the toolkit (104) is an 
30 annotated ASR corpus (122), which is actually a pair of 
flat files. A sample format for the files is shown in 
figure 3. The first of the pair is a corpus file, and 
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contains a listing of all possible logical sentences or 
phrases in the grammar (with the exception of variables, 
discussed below) , the compartments (groups of tables) in 
which they appear, and a value representing the class of 
5 the utcerance (sentence) heard. The second is an answers 
file that maps each utterance class with a token, or data 
value thar represents the meaning of the utterance. 
These two files will be used by the runtime interpreter 
(124 i . 

to 

During runtime, a speaker speaks into the 
ndcropho-e (or telephone) (120) attached to the speech 
recognizer (116). The recognizer (116) identifies the 
words and phrases it hears and notifies the IVR (130) 
15 when a valid utterance has been heard. The IVR (130) is 
the system which needs the speech understanding 
capabilities, and includes the necessary external 
connections and hardware to function (for example, a 
banking IVR - 130 might include a connection to the bank 
20 database, a keypad for entering data, a visual display 
for displaying information, a dispenser for dispensing 
money, and a speaker for speaking back to the user) . 
This valid utterance is passed, in a computer-readable 
form such as text, to the IVR (130) which then notifies 
25 the runtime interpreter (124) of the utterance that was 
heard. The runtime interpreter (12^) consults the 
annotated ASR corpus (122) and returns an appropriate 
token to the IVR (130) for the valid • sentence heard by 
the recognizer (116) . This token represents the meaning 
30 of the utterance that was heard by the recognizer (116)., 
and the IVR (130) is then able to properly respond to the 
utterance. The CP (126) and RIAPI. (128) serve as 
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softv;are interfaces chrough v/hich the IVR (130) may 
access the runtime interpreter (12-^1) . It is the IVR 
{130) that ultimately uses the speech capabilities to 
interact v/ith the speaker during runtime. 

5 

3. The Runtime Interpreter 

The runtime interpreter (124) is a software 

10 component that receives, in text form, a valid spoken 
utterance that was tiearct and context information 
ider.tifyinq the compar tinent: ( s ) to be searched. The 
runtime interpreter (124) then performs a search Tinrough 
the corpus file (122) (v/hich has been loaded into RAM for 

15 faster searching) to find the valid utterance. Once a 
valid utterance is found in the corpus, the associared 
token is stored in memory to be retrieved by the IVP. 
(130) . In an embedded application, calls made to the 
runtime interpreter (124) are made by functions v;ithin 

20 the Custom Processor (126), or CP. The CP (126) is 
another softv/are component that is originally a 
transparent ''middleman" betv/een the runtim.e interpreter 
(124) and the RIAPI (128). The IVR (130), created by the 
developer, only accesses the functions within the RIAPI 

25 (128). The RIAPI (128) will make the necessary CP (126) 
calls which, in turn, make the necessary runtime ■ 
interpreter (124) calls. 

The purpose for having the CP (126) lies in 
30 customizability. The CP (126) can be customized by the 
developer to enhance the processing of utterances. For 
example, the developer may V7ish to perform some type of 
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processing on the utterance before it is actually 
processed by the runtime interpreter (124; . This pre- 
processing can be added, by the developer, to the CP 
- (126) v/it:hout actually modifyinq the runtime interpreter 
5 (12^). Use of the CP (126) is particularly convenient 
when the under iyincr IVB (130) is done in a lov/ level 
scripting language, such as Vos (by Parley) 02: BlaBla (by 
MediaSoft;> that does not directly support the pre- 
processing of the utterance text. if the I VP. (130) is 
10 v/ritten in a higher level language^ such as C + +, then 

ore-processing of the utterance texc can be done in the 
IVH \1?.0; code itself, withouc need for the CP (126). 

The runtime interpreter (124) also provides 
15 functionality to extract variables from utterances. When 
the corpus file is first loaded, corpus items that 
contain variables are flagged. If an initial binary 
search through the corpus fails to find the exact 
utterance, a second search is performed to find a partial 
20. match of the utterance. This time, only flagged corpus 
items are searched, and a partial match is found if the 
utterance contains at least the non-variable portions of 
a corpus item. 

25" For example, the preferred embodiment corpus 

file (122) format uses square brackets (M' and M') to 
set off variables from normal words in the valid 
utterances. Thus, the utterance ''I want to transfer 
[CURRENCYl, money] to savings" might be found in the 

30 corpus file. If the spolcen utterance heard by the 

recognizer (116) is '"I want to transfer uen dollars to 
savings", an initial binary search would probably fail to 
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match the spoken utterance with any of the corpus items. 
If this iriitial search fails, the interpreter (124) then 
performs a second search of ail flagged corpus items. 
The spoken utterance that v/as heard at least concained ''I 
5 want to transfer ... to savings", and a partial match 

v/ould be made. The unmatched words, "ten dollars"^ v;ould 
then be processed by another algorithm as a variable of 
type L CURRENCY 1 , money] which would convert the phrase 
"ten doi:.ars" to 10.00 and return 10.00 as the variable 

10 associated with the r.okon "money". This variable data is 
then stored in a predefined data structure that is 
associated v/ith the location ir\ memory v/here the token 
was stored. Wnen the IVR (130) processes the "coken, it 
knows thar variable data was also returned and retrieves 

15 the variable data from memory. 

The algorithm for converting variables in 
utterances to variable data depends on the type of data 
contained within the variable. Figure 2 shows variable 
20 types that are supported by the preferred embodiment. 
The following pseudocode illustrates the steps used in 
the preferred embodiment to convert the variable portion 
of the utterance (in text form) to variable data (in 
number form) . " ^ 

25 

INTEGERl: ("One hundred thousand and ten") 
Set TEKP result buffer = 0; 

Separate variable portion of utterance into individual 
words 

30 (i.e. "one" "hundred" "thousand" ^^and" "ten") based 

on blank spaces between words. 
FOR each individual word (reading left to right) : 
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if f individual word = "one"), increase TEMP by 1; 
if individual word - "two"), increase TEMP by 2; 

if (individual word =^ "twenty"), increase TEMP by 

if (individual word = "thirty"), increase TEMP by 

^ 30; 

if (individual word = "ninecy"), increase TEJ^P by 

10 90; 

if {individual word = "hundred") 

if (TEMP > 1000 i ("four thousand five hundred", 

v;hen the 

word "hundred" is reached, will have been 

15 " handled as 

"four thousand five", and TEMP would be 4005. 

This 

needs to be changed to 45, before multiplying 

by 100) _ 
20 TEMP = (TEMP/100) + least significant digit of 

TEMP; 

end if; 

multiply TEMP by 100; 
end if; 

25 if (individual word - "thousand"), multiply TEMP by 

1000; 

if {individual, word = "and"), ignore; 
END FOR loop; 

30 INTEGER2: {"One Two Three Four") 

As in INTEGERl, breaic up variable utterance in^o 

individual 



wo 99/14743 



PCT/LIS98/19433 



18 



words, and set a TEMP buffer = 0; 
FOP. each individual word (reading left to right) 
multiply TEMP by 10; 

if (individual word = ''one"), increase TEMP by 1; 

if findividual word =^ ''nine")> increase TEMP by 9; 
EMD FOR; 



10 



15 



CURREMCYl : (^'Twenty three dollars and fifteen cents") 
As in INTEGERl, break up variable utterance into 

i rrv.'-id i vuol 

woros, and set a TEMP buffer = 0; 
FOR each individual v/ord (reading left to right) : 



if (individual word 
if (individual word 



'one"), increase TEMP by 1; 
^two"), increase TEMP by 2; 



20; 



20 30; 



if (individual word = "tv/enty"), increase TEMP by 



if (individual -word = ''thirty"), increase TEMP by 



25 



30 



90; 



if (individual word = 



^ninety") , increase TEMP by 
'hundred") „ , 



TEMP; 



if (individual word 
if (TEMP > 1000) 

TEMP = (TEMP/100) + least significant digit of 

end if; 

multiply TEMP by 100; 
end if; 

if (individual word ^ "thousand"), multiply TEMP by 



1000; 



BNSDOCID: <WO__991 4743*1 JLj* 



wo 99/14743 



PCT/US98/I9433 



19 



if (individuai word = ''and"), ignore 

if (individual vjord = "dollars"), multiply TEMP by 

100; 

-? If ^individua] word = "cents"), ignore 

5 END FOB 

Return the number TEKP/100; 

CURREMCY2 : ("tvjo three dollars and one five cents") 
As in IMTEGERl, bteak up variatO.e utterance into 
10 individuai 

words, and set a TEMP buffer = 0; 
FOR each individual word { reading left to right) 

if (individual word = "dollars"), multiply TEMP by 

100; 

15 else, multiply TEMP by 10; 

if (individuai word = ''one"), increase TEMP by 1; 

if (individual word = "nine"), increase TEMP by 9; 
if .(individual word = "cents"), divide TEMP by 10; 
20 if (individual word = "and"), divide TEMP by 10; 

END FOR; 

Return TEMP/100 as number in dollars; 

TIME: ("one o'clock p.m.", "thirteen hundred hours") 
25 As in INTEGERl, break up variable utterance into 

individual 

words, and set buffers HOUR, MIN =0) 
Discard {"o'clock",, "minutes", "hours") 
if (first word = "one"), HOUR = 1; 
30 . . . 

if (first word = "twenty") 
HOUR = 20; 
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if (second word - ''one"), increase HOUR by 1; 
if fsecond v;ord = "two"), increase HOUR by 2; 
if (second word = ''three"), increase HOUR by 3; 
nex" word ~ third word; 
5 else, next *v;o r d - second wo r d ; 

FOR. (each individual v/ord from nexn v;ord on) 

if ; individual word = ''one"), increase MIN by 1; 

if 'individual word = "twenty"), increase MIM by 20; 
10 if (individual word = "thirty"), increase MIN by 30; 

if f individual word = "forty"), increase HIM by 40; 
if (individual word = "fifty"), increase MIt-i by 50; 
if ({individual word = "p,m,'M and (HOUR < 12)) 
increase HOUR by 12; 
15 end if; 

end FOR 

if (TIME2), return " HOUR : MIN" ; 
if (TIMED 

if fHOUR > 12) 
20 decrease HOUR by 12; 

return "HOUR:MIN PM" ; 
else 

return "HOURiMIN AM"; 
end if; 
25 end if; 

DATEl: ("March first nineteen ninety seven", "the first 
of March") 

As in IWTEGERl, break up variable utterance into 
30 individual 

words, and set buffers MONTH, DAY, YEAR, UNKNOWN = C 
and set 
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flag DONE - N) 
FOR each word 

if (word = '"january"), MONTH = 1; 

5 if (v.'ord = ^^december" ) , MONTH = 12; 

if {word = "first", . . . "twentieth", or 
"thirtieth") 

if {word = "first"), DAY = 1; 

1 0 i f { wo rd = "twentieth " ) , DAY = 20; 

if (word = "thirtieth"), DAY = 30; 
i f ( UN KN OWN is not 0 } (i.e., was " t v/e n t :/ " o r 
"thirty" ) 

Add UNKNOWN to DAY; 
15 reset UNKNOWN to 0; 

end if; 

else if (word = "oh", "one", . . . "nineteen") 
if (word = "oh") AND (UNKNOWN is not 0) 
Add (UNKNOWN* 10 0) to YEAR; 
20 UNKNOWN =0; 

go to next word; 
else 

if (YEAR is not 0) 

Add (value of word) to YEAR; 
25 go to next word; 

else 

if (UNKNOWN is not 0) AND ( (value of 

word)<10) 

Add (value" of word) and UNKNOWN co YEAR; 
30 UNKNOWN =0; 

else if {UNKNOWN is not 0) 
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Add ( i 0 0 f.- K :<MOWN ) a n ci (value of word ) c o 

YEAR; 

UNKNOWN = 0; 
else 

5 UNKMOWM = (value of word); 

go to rie>:t v/ord; 
end. else- 
end else; 
end else; 

10 else if (v;ord = "twenty" or "thirty") 

xt {UMPtMOWN IS not 0) 

Add { iOO ^UNKMOW";; ; and (value of v/ord) Lo YEAR; 
UNKNOWN =0; 
go to next word; 
15 else 

UNKNOWN = (value of word) ; 
go to next vyord; 
end else; 

else if (word = "forty", ^^fifty", . . . "ninety") 
20 if (UNKNOWN is not 0) 

YEAR - 10 0-^UN KNOWN; 
UNKNOWN =0; 
end if; 

if (YEAR is not 0 ) - • 

25 Add (value of word) to YEAR; 

go to next v/ord; 
else 

UNKNOWN = (value of word) ; 
go to next word; 
30 end else; 

else if (word = "hundred") 
if (UNKNOWN is not 0) 
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Add (lOO'-UMKMOWti) to YEAR; 
UMKNOWW = 0; 
end if; 

qo to next word ; 
5 else if (vjord - thousand" ) 

if (UNKNOWN is not 0) 

Add (lOOO'MMKMOVJM) no YEAR; 

UNKNOWN =0; 
er.d if; 

10 qo to next v/ord; 

end else if; 
■ if :UMKMOWN is noz 0;, Add UNKNOWN to YEAR; 
Return MONTH, DAY and YEAR in whatever format is 
selected ; 
15 end FOR; 

END DATE; 

The basic operation of the runtime interpreter 
(124), as seen and used by an interactive voice response 
20 (IVR) system, is shown in figure 4. In the following 

description, the specific function names used are those 
found in the preferred embodiment. First, the IVR (130) 
must be started in step 400. The IVR (130) is a software 
system for performing other duties, such as controlling a 
25 bank's automated teller machine, that takes advantage of 
the speech understanding capabilities of the present 
invention- For example, a bank may develop an IVR (130) 
to provide for a talking automated teller machine. In 
the preferred embodiment, the IVR (130) is responsible 
30 for managing the speech recognizer (116) during runtime. 
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One of the first things the IVR (130) will want 
to do is initialize the speech recognizer in step 402. 
The exacu steps necessary for initializing a speech 
recognizer will vary depending on which conunercial speech 
5 recognizer is used, but the general steps will involve 
compiling the vendor-specific ASR gramrnar (112) that was 
generated using the toolkit (104), and loading tihe 
compiled version inco some form of local memory 
accessible to the speech recognizer (116). 

10 

Next, in step 4 0^1, the runtime interpreter 
(124) v/iil neeo to be initialized. This is done when ^he 
IVR (130; calls the ML_Init function. This function 
essentially receives, as input, a file path and name for 
15 the annotated ASR corpus (122) that will be used for the 
current application and stores this file path and name in 
memory. 

In step 406, the IVR (130) finishes setting up 
20 the runtime interpreter (124) by calling the t^}L_OpenApp 
function. This function access the corpus file whose 
name and file path were stored by the NL_Init function in 
step 404, and loads the corpus into system memory (RAM) 
in preparation of being searched. In order to optimize 
25 the search, the contents of the corpus file (the various 
valid utterances) are alphabetized when loaded into RAM. 
Alphabetizing the valid utterances will enhance the 
search performance because, in the preferred embodiment, 
a binary search is used to match an utterance with a 
30 token. Binary searches are a common method of searching 
through sorted lists to find a target element, and 
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basically involves progressively halving the range of 
list items being searched until the target item is found. 

r; During this loading process, the corpus dana is 

5- also optimized by 1) flagging corpus items that contain 
variables and 2) generating the list (from large to 
small) that specifies the order in which corpus items are 
processed for the second search. This last bit of 
octimization is important because, as the second search 
10 looks for fragments, smaller fragments (fev/er words) may 
inadvertently match v/hen a larger fragment is more 
appropriate. For example, the item: 

"I want to transfer ... to savings" is smaller than the 
item "I want to transfer ... British pounds to savings". 
15' If the spoken utterance is "I want to transfer ten 
British pounds to savings" and the smaller item is 
processed first, it will incorrectly match ('^I want to 
transfer ... to savings" is found) and send the remaining 
words {"-cen British pounds") for processing as a variable 
20 in the first item, when ''ten" should actually be 

processed as a .variable in the second item. It is 
■""important that larger items are processed first when the 
second search is conducted, and this ordering is done 
when the corpus is initially loaded into the RAM memory. 
25 A separate list of pointers is generated and stored in 
memory when the corpus is loaded, and this list 
identifies the order (large to small) in which items with 
variables should be processed. A list of flagged corpus 
items is also stored in memory. 

30 

Once both the speech recognizer (116) and 
runtime interpreter (124) have been initialized, and 
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after the runtime interpreter (124) has loaded the 
corpus, the runtime interpreter is ready to do its job. 
At this point, tihe IVR (130) may have other processing to 
do, and the runtime interpreter (124) waits. 

5 

At some point in the future, the IVR (130) v^ill 
detect that 'a conversation with a speaker is about to 
begin. When this happens, the IVR (130) will need to 
open a session v/ithln the runtime interpreter (124) (a 

10 session is a dialog exchange with the speaker). The IVR 
(130; does this by ca.llinq the ML_OpenSess ion function in 
step 40^S. This function creates a session handle and 
associates rhe session handle with the session that was 
opened. Future function calls relating to this session 

15 will use the session handle to reference the session. 

Then, in step 408, the speech recognizer (116) 
informs the IVR (130) that a complete utterance may have 
been heard. In the preferred embodiment, speech 

20 recognizers (116) are of the type that return data in 

NBest form. NBest form is simply an output data ' format 
that includes a list of possible valid utterances (in 
text form) heard by the speech recognizer (116) along 
with a confidence number indicating the likelihood that 

25 each valid utterance was heard. 

The NBest format is helpful when there are 
multiple valid utterances that sound alike. For example, 
if the valid grammar includes "I want honey" and "I v;ant 
30 money", and the speaker mumbles "I want mf honey", the 
speech recognizer will return both valid utterances as 
possibilities, rather than simply returning the single 
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valid u-terance that it believed sounded most correct. A 
confidence number is also included for each valid 
utterance, indicating the speech recognizer' s confidence 

i that that particular valid utterance was indeed the one 
5 j i: heard. This plurality of possibilies is helpful when 

•: the runtime interpreter (124) also knov/s the context of 
the current discussion and can use the context 
information to more accurately determine v/hich valid 
utterance was meant. As will be described below, the 
10 preferred embodiment of the runtime interpreter (12^1) 

will use such ccnte:-:- information in its determination of 
what was meant. 

After the IVR (130) receives the output from 
15- the speech recognizer (116), the output is then passed, 
in step 410, to the runtime interpreter (124) for 
interpretation. To do so, the IVR (130) will call, in 
the preferred embodiment, the NL_Analy zeNBest function. 
This function accepts as input the HBest data received by 
20 the IVR (130), a session handle, and a conte.xt pointer 
indicating the compartment that is to be searched. 

When the NL_AnalyzeNBest function is executed, 
the runtime interpreter (124) then searches through the 

25 corpus (122) that has been loaded into memory to find the 
valid utterance. If a match is found, the return token 
is stored in memory. If no match is found, the variables 
search discussed above will be performed and the variable 
data will be stored in a predefined data structure.. This 

30 search is shown in s^iep .412. 
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After caiiing ML_AnalyzeMBest , the IVR (130) 
v/ill need to call ^JL_GetRe3ult , in step 416> to retrieve 
trom memory the token stored by the WL_AnalyzeMBest 
function- If the token indicates that a variable was 
5 included in the utterance^ then the IVR (130), in step 
416, will call ML_Get Va r iable to retrieve the variable 
values from the predefined data struciiure in memory used 
by tIL Ana lyzeMBest to store the variable data. 

10 Once the token and any necessary data have been 

stored in memory, the runtime interpreter (12*^1) is 
finished zor the session {for now). In step 413, the 
runtim.e interpreter (124) -waits for either another 
utterance or an end to the session. 



15 



20 



If another utterance occurs, the speech 
recognizer (116) will again notify the IVR (130) in step 
408, the IVR (130) will call NL_Analy zeNBest in step 410, 
and the process continues as it did before. 

If the session is to end, the IVR (130) v/lli 
call NL_CloseSession in step 420. Closing the session 
deassociates the session handle. 



25 At this point, step 422, the runtime 

interpreter (124) waits for either a nev; session to begin 
or for the command to shut down the current application. 
If a new session is to begin, the IVR (130) will call 
NL_OpenSession_ again in step 404 and processing continues 

30 from step 404 as before. -If the current application is 
to be shut down, then IVR (130) will call NL_CloseApp in 
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step 424 7.0 release the memory that h.ad been allocated 
when the application was opened. 

Then, in step 426> the IVR (130) calls 
5 i]L_3hutdown to undo the effects of ML^lnit. 

Pinaily, in steps 428 and 430, the IVR (130) is 
responsible for shutting down the speech recognizer (116) 
as wel] as the JVP .'130) itself. The actual steps 
10 necessary v/ill vary depending on the selected speech 
recoqnizer (116) as well as the IVR developer. 

The runcirne interpreter (124) also provides 
functionality for the developer who wishes to manage the 

15 , Nbest data passed by the CP (126). Functions are 

available to create Nbest buffers (NB_CreateBuf f er ) ; 
create an Nbest buffer with only one utterance 
(NB__CreateOneBest) ; set an utterance in an Nbest buffer 
(NB_Setrjrterance) ; set a score for an utterance in an 

20 Nbest buffer (NB_SetScore) ; set an utterance/score pair 
in an Mbest buffer (NB_SetUtteranceScore) ; determine the 
number of utterances that can be stored in the Nbest 
buffer (NB^GetNumResponses) ; get an utterance from an 
Nbest buffer (N3_GetUtterance) ; get a score from an Nbest 

25 buffer (NB_GetScore) and release the memory allocated for 
a specified Nbest buffer (NB_DestroyBuf f er ) . 

4. The Runtim.e Interpreter Application Program 
Interface 

30 • 

The runtime interpreter application program 
interface (123), or RIAPI, is the set of software 
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functions actuaiiy used oy the developer of the IVR (130) 
to interact v;i th the runtime interpreter {12A) . The 
functions which are included in the preferred embodiment 
of the RIAFI (128) include: ML_Init{), NL_OpenApp ( ) , 
5 ML_OpenSession ( ) , ML Ana i y zeMbes t ( ) , ML GetResult(), 

ML_GetVariabie { ) , NL_CloseSession ( ) , HL_CioseApp ( ) and 
ML_S hut down ( ) . 

[■U._Init is an initialization function that is 
10 called one time during startup to process ini L ia 1 iza t ion 
information and allocate memory for sessions. 
Initialization information can include a name for a local 
log file, the maximum number of sessions and the routing 
mode (embedded or distributed - distributed architecture 
15 v/ill be discussed further below) . A call to NL_Init, in 
the exemplary embodiment, results in a call to CP_Init 
(the CP equivalent), v;hich then calls SAI_Init (the 
runtime interpreter 124 equivalent). Most of the 
following RIAPI (128) functions will also result in 
20 function calls to the C? (126), which then calls the 

corresponding runtime interpreter (124) function. Two 
exceptions in the preferred embodiment are the 
NL_GetVariable and t\!L_GetResult function, which directly 
access memory to retrieve the variable or result. 

25 

NL_OpenApp is called to establish an 
application in the interpreter (124). As stated before, 
an application is an instance, or implementation, of a 
project. Opening an application causes the interpreter 
30 (12'1) to load the corpus files (122) associated with the 
application. 
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bJL_OpenSessicn is called when a session is 
desired under an open appiication. A session is 

f essentially a conversation with a speaker, and it is 
possible for several sessions to exist for the same 

5 application (if the IVR 130 manages several speech 

>i recognizers, for example) . 

ML AnaiyzeMbest is called by che IVR (130) v/hen 
_ r|-,g speech recognizer has indicated tha-_ it has Mbest 
10 output ready. The IVP. (130) calls this function to send 
rhis Hbest output, as v/ell as contextual information, to 
zhe runtime interpreter (124) for analysis. 

■- NL_GetResult is called by the IVR (130) to read 

15 the token which was stored in memory by the runtime 
interpreter (124). 

ML GetVariable is called when the token stored 
by the interpreter (124) is of a type that has variable 
20 data associated wi-;:h it. The call to HL__GetVar iable 

retrieves this variable data from a memory data structure 
■ used by the interpreter (124) to store the data. 

WL_CloseSession is called to close the 
25 specified session and return any allocated resources that 
were associated with the session. Calling this function 
may result in the calling of other functions that are 
also necessary for closing the session. For example, in 
the embedded architecture, NL_CloseSession calls 
30 CP_CioseSession to allow the CP (126) and runtime 

interpreter (124) an opportunity to properly close their 
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respective sessions and return allocated resources that 
they no longer need. 

ML_CioseApp is called to close the specified 
5 application. This function checks to ensure that ail 
sessions have been closed, and may also call other 
functions sucn as CP_CloseApp to allov^ the CP (12 6) and 
interprerer (124) the opportunity to "clean up after 
themselves" as well. 

10 

Ml., Shutdovjn is called to essentially return the 
sysr.em tc tr-e stat-^,- that exisceri before ML Init was 
called. CP_Shutdown may also be called to have the CP 
(126; and interpreter (124) deallocate their resources. 

15 

In addition to these basic functions, the RIAPI 
(128) is also provided with inter/ intranet capabilities. 
If the natural language system is connecred via TCP/IP to 
a network, the TcpCallback function can be used to 

20 process asynchronous TCP/IP socket events. The following 
RIAPI calls designed to support connections through 
Server Interface Process (SIP) to the internet are also 
available (although not necessary for non-SIP systems) : 
NL_WEBConnect (to open' a session with a remote web 

25 browser user) , NL_ReportWEBText (to pass text responses 
to the interpreter 124), ML^WEBPlay (to present or 
display file contents to the remote user) , NL_WEBListen 
(to direct one session to accept input from the SIP 
instance connected by NL__WEBConnect ) , NL_GetWEBResul t (to 

30 retrieve results of an ML_VJEBListen call) and 
NL CloseWEBSession (to close a session). 
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As an interface between the IVR (130) and 
(ultimaceiy) the runtime interpreter (12^), the specific 
calls made to the RIAPl (128) will be dictated by the 
needs of the IVR {130i for the functionality of the 
5 runtime interpreter (124). 

5. Overview of rhe Distributed Archi tec^iure 

Thus far, this specification has been 
10 describing the elements of an embedded system 

arch.i.tecture . In an embedded architecture, both the 
runtime interpreter {12-i) ana P.IAPI (128) are sofi:ware 
elements that reside on the same computer, 

15 In a distributed architecture, a plurality of 

distributed runtime interpreters (508) is located among a 
. plurality of locations within a computer network (in the 
preferred embodiment, both Unix, and Windows MT networks 
are supported) . By having this plurality of interpreters 

20 (508), the IVR (130) is able to have a number of 

utterances processed simultaneously. The clearest 
advantage to this is the ability to operate multiple 
sessions at the same time. 

25 Figure 5 shows the elements of a distributed 

system architecture. Most of the elements are the same 
as the ones found in the embedded architecture. Both the 
grammar (112) and corpus (122) are the same as those used 
in the embedded architecture. The differences are the 

30 plurality of distributed interpreters (508), the resource 
manager (510 - RM) , logger (512), operator display (514) 
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and log viewer (516). The distributed interpreters (508) 
and RM (506) are discussed further below. 

The logger (512) is simply a software device 
5 that records the various messages that are sent between 
the resource manager (510) and various interpreters 
(508) . Operator display (514) and log viev/er (516) are 
means by v/hich the developer may monicor the operation of 
the IVR (130) and the various interpreters connected to 
[0 the system. In the preferred embodiment, the logger 

(512), operator display (53 4) and log viewer (516) do not 
allov; the user or operator any control over the IVR (130) 
application. These devices merely provide information on 
the operation of the application. 

15 

6. The Distributed Interpreter 

In an alternative embodiment of the present 
invention, a distributed system is used. The distributed 
20 sysrem operates on a netv/ork.ed computer system. A 

networked computer system, simply means a plurality of 
computers, or nodes, which are interconnected to one 
another via a communications network. 

25 In a distributed system, each node that 

performs interpreting duties has a DI manager (504), a 
DICP (506) and a DI runtime interpreter (508). The DICP 
(506) and DI runtime interpreter (508) have the same 
functionality as the CP (126) and runtime interpreter 

30 (124) found in the embedded architecture discussed above. 
The DI manager (504) is another piece of software that is 
responsible for message processing and coordination of 
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the interpreting duties of the node. Message processing 
depends on the type of netv/ork used to connect che node 
to the resource manager (510). However, the same general 
message types are used. The message types and purposes 
5* are discussed below. 

The manager (504) itself is a software 
compoTient, and before it can process any messages iz must 
first be executing cn the interpreting node. When the 

10 manager ISOA) is started, it will look in an 

initialization file for information regarding i:he 
application siJpported by the manager (504). This 
information includes the name of the application 
supported, and the file path to the location of the 

15 annotated corpus (122) to be used for the application 
supported . 

The <initiali2e> message causes the DI manager 
(504) to initialize the DICP (506) by calling CF_Init, 
• 20 and the DICP (506) initializes the DI runtime interpreter 
(508) by calling SAI^Init. This message also causes the 
DI manager (504) to initialize the application to be 
supported, by calling CP_OpenApp and SAI_OpenApp to open 
the application. As discussed above, opening an 
25 application requires loading the corpus (122) . The - 

location of the corpus (122) to be loaded is passed on to 
the DI runtime interpreter (508). When the DI runtime 
interpreter (508) completes its initialization (and the 
corpus 122 is loaded) , it generates an application handle 
30 which is a data object that references the current 

application. This handle is returned to the DICP (506), 
which in turn passes it back to the DI manager (504) . 
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Whenever an error occurs within the DI (502 j, the DI 
manager (504) composes a <tell error> message describing 
the error and returns it to the RM (510). 

5 A session will be opened when the DI manager 

(504) receives a <start session> message. This message 
includes a resource address which ■ ident i fies the sending 
IVR (130-; and a session identifier. The Dl manager (504) 
checks zo maKe sure chere is not already a session opened 
10 with the same resource address, and if there is not, 
creates a session object v/hich will represent the 
sessiori. A session object: is sssencially a handle, 
similar to the application handle discussed above, that 
references this session. The DI manager (504) then opens 
15 the session in the DICP (506) and DI runtime interpreter 
(508) by calling the CP_OpenSess ion function, which calls 
the SAI_OpenSession function. The return value of 
SAI_OpenSession is passed back to CP_OpenSession, which 
returns it to the DI manager (504). Again, errors are 
20 reported by the DI manager (504) with a <tell error> 
message. 

Once a session has been opened, the DI (502) is 
ready to interpret. There are two messages which can 

25 start the process of interpretation. First, the DI 

manager (504) could receive an <analyze> message. An 
<analyze> message contains all the context and nbest 
information normally needed for CP_Analy zeWbest . The DI 
manager (504) then calls the DI runtime interpreter (508) 

30 convenience functions NB_CreateEuf f er and 

NB^SetUtteranceScore to prepare a structure with the 
context and nbest data. The DI manager (504) then 
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provides this data structure as input lo the 
CP_AnalyzeMbest function, vjhich calls the 
SAI AnaiyzeNbest function which performs the search 
described above for the embedded architecture. When 
5 these functions have completed, their return values 

propagate back to the DI manager (50^), which composes 
and sends a <reply> message back to the RM (5101. 

Receiving the <analyze> message was just one 
10 way the interpretation could be started. The other way 
occurs v/hen the context and nbest data are sent in 
separa-^e messages. When this occurs, the RH (510) sends 
^ a first message, or <5tate> message, containing the 
* context and a resource address identifying the session in 
15 which the utterance was heard. Upon receipt of this 
message, the DI manager (504) first confirms that the 
resource address is indeed that of an existing session. 
. If it is, the DI manager (504) retrieves the session 
handle associated with the resource address, and stores 
20 the context informacion from the message in 5 temporary 
memory area to await further processing. 

This further processing v/ill occur when the 
second message is received by the DI manager (504). The 

25 second message, or <nbest> message, contains a resource 

address and some nbest data. When the <nbest> message is 
received, the DI manager (504) again checks to make sure 
the resource address included in the <nbest> message is 
that of an existing session. If so, the DI manager (504) 

30 then looks to the temporary memory area associated with 
the session, and finds the previously stored context 
information. Taking the nbest data and context data, the 
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DI manager (504) then makes a call to CP^Anal yzeMbes t , 
v^hich then calls SAI_Ana lyzeNbes t , where the corpus (122) 
is searched' to find the token associated v;ith the 
utterance in the nbest data. 

5 

A session is ended when the DI manager (504) 
receives the <lost call-> message. This message includes 
a resource address, and the DI manager (5CM) checks to 
make sure nhat the resource address does indeed reference 
10 an open session. If sc^ the DI manager (504) calls 

CP CloseSession> which then cails SAI_CloseSess ior., and 
the session is closed much in the same way a session is 
closed in the embedded architecture. 

15 If the entire application, is to be shut down, 

the DI manager (504) will receive a <terminate> message. 
Since each manager (504) can on].y support one application 
at a time, shutting dovm an application is the same as 
shutting down the manager (504) . When the DI manager 

20 (504) receives this message, it makes the necessary calls 
to CP_CloseSession to close any remaining sessions that 
are open, and finally calls CP_Shutdown, which calls 
SAI_ShutDown, and all resources allocated to the manager 
(504), DICP (506) and DI runtime interpreter (508) are 

25 released. 

7. The Resource Manager 

The resource manager (510) monitors the 
30 operation of the various distributed interpreters (508) 
connected to the network, and distributes RIAPI (123) 
requests among the interpreters (508). In the preferred 



BNSOOCIO: <WO_9914743A1J_> 



wo 99/14743 



PCt/US98/J9433- 



• 39 

embodimenii , the RM (510) receives a message v/he never a 
distributed interprener 508 ) is initiated and records 
v^-r the application that is supported by the distributed 
^ interpreter (508). Then, as the resource manager 
5 receives requests from the IVR(s) (130) through the RIAPI 
^/ (12S), it checks to see which distributed interpreter 
^ (508) can handle the request (supporLS the application) 
and formulates a message containing the IVR (130) request 
and sends it to the appropriate manager (50^1) for 
10 processing. The resource manager (510) communicates v>ith 
.managers (50^) using the messages described above. 

? In light of the above teachings, it is 

understood that variations are possible without departing 

15, from the scope of the invention embodied in these 
teachings. Any examples provided as part of the 
inventors' preferred embodiment are presented by way of 
example only, and are not intended to limit the scope of 
the invention. Rather, the scope of the invent ion • should 

20 be determined using the claims belov/. 
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Wha t is claimed is : 

Claiiiis 

1. A computer system for identifying the meaning behind 
5 a valid spoken utterance in a grammar, said system 
compr is ing : 

a central processing unit (CPU); 
a system memory coupled to said CPV for 
receiving and storing memory files; 
10 a random access memory (RAM) portion of said 

system memory, for temporarily receiving and storing data 
during operation of said CPU; 

an annotated automatic speech recognition (ASPO 
corpus file, stored in said system memory, containing a 
15 listing of valid utterances in said grammar and token 
data representing the meaning of each listed valid 
utterance ; 

an ASR system coupled to said CPU for detecting 
spoken utterances and generating output signals 

20 indicative of detected valid utterances; 

a vendor specific ASR grammar file, stored in 
said system memory, containing data representing all 
valid utterances to be detected by said automatic speech 
recognition system; and 

25 runtime interpreter means coupled to said CPU 

for identifying, when said CPU receives said speech 
recognizer output signals indicative of a detected valid 
utterance, the meaning of said detected valid utterance. 



BNSOOaO: ^0_991474aA1_l_?. 



wo 99/14743 



PCT/US98/i9433 



41 



2. The system of claim 1, where said runtime 
interpreter means further includes means for performing a 

if comparison search through said annotated ASR corpus in 
5; RAM to find the token data identifying the meaning of the 
detected valid utterance. 

3. The system of claim 2, where said runtime 
interpreter means further includes means for performing a 

10 partial match search through contents of said annotated 
• ASP corpus upon failure of said comparison search, where 
said partial match search seeks a partial matcn of said 
detected valid utterance in said annotated ASR corpus. 

15 4. The system of claim 3, where said runtime 

interpreter means further includes variable processing 
means for processing the unmatched portion of said 
detected valid utterance as a variable to identify the 
meaning of said unmatched portion. 



20 



25 



5. * The system of claim 4, where said variable 
processing means generates variable data representing the 
meaning of said unmatched portion of said detected valid 
utterance . 

6. The system of claim 1, further comprising a runtime 
interpreter application program interface. (RIAPI) , 
coupled to said runtime interpreter, where said RIAPI is 
an interface used to access said runtime interpreter. 



30 
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7. The system of claim 6, further comprising a custom 
processor (CP) interface, coupled to said RIAPI and said 
runtime interpreter, which is an interface used by said 
RIAPi to access said runtime interpreter, 

5 

8. The computer system of claim 1, where said computer 
system is a network system and a plurality of said 
runtime interpreters is distributed on a plurality of 
computers on said computer system netv^^ork. 

10 

9. The computer system of cJ.aim 8, further comprising a 
resource m.anager for managing access to said plurality of 
runtime interpreters. 

15 10. A speech enabled computer system, comprising: 
a CPU; 

a system memory coupled to said CPU; 
a vendor-specific automatic speech recognition 
(ASR) grammar file stored in said system memory; 
20 an automatic speech recognition system, coupled 

to said CPU, for detecting valid spoken utterances; 

an annotated ASP. corpus file, stored in said 
system memory, containing a listing of valid utterances 
and token data representing the meaning of each listed 
25 valid utterance; 

a runtime interpreter, coupled to said 
annotated ASR corpus file and said CPU, for searching 
through contents of said annotated ASR corpus file to 
find token data representing the meaning of said detected 
30 valid spoken utterance; 

a custom processor interface (CP) , coupled to 
said CPU, for accessing said runtime interpreter; 



4 
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a runtime interpreter application program 
interface (RIAPI), coupled to said CP, for accessing said 
CP; and 

an interactive voice response system (IVR), 
5 coupled to said RIA?I and said ASR system, where said IVR 
issues requests to said RIAPI to search contents of said 
• annotated ASR corpus file for token data representing the 
meaning of a valid spoken utterance detected by said ASR 
syscem . 

10 

13. A method for identifying the meaning behind a valid 
utterance in a grammar, com.prising the steps of: 

loading an annotated ASR corpus file into 
system memory, where said ' annotated ASR corpus file 
15; contains a listing of valid utterances in said grammar, 
as well as token data for each of said listed valid 
utterances representing the meaning behind said valid 
utterances ; 

receiving a request to search said annotated 
20 ASR corpus for occurrence of a detected valid utterance; 

performing a first search through said loaded 
annotated ASR corpus in system memory to find said valid 
utterance; and 

returning token data corresponding to said 
25 detected valid utterance in said loaded annotated ASR 
corpus to sender of said request. 

12. The method of claim 11, further comprising the steps 
of: 

30 performing a second search through said loaded 

annotated ASR corpus upon failure of said first search, 
where said second search seeks a partial match of said 



BNSDOCaO: <WO_S914743A1J^ 



Wd 99/U743 



PCTAUS98/I9433 

44 



detected valid utterance in said annotated ASR corpus; 
and 

processing the unmatched portion of said 

detected valid utterance as a variable and returning 
5 variable data to sender of said request, where said 

variable data represents the meaning of said unmatched 
portion . 



13. The method of claim 11, further comprising the step 
10 of using a runtime interpreter application program 

interface (RIAPI) to access said runtime interpreter. 

14. The method of claim 13, further comprising the step 
of using a custom processor (CP) interface to access said 

15 runtime interpreter. 
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400, 



Start IVR 



402, 
404. 
406^ 



I 



Initialize Speech Recognizer 



I 



Initialize Interpreter 



408. 



Open Session 



Speech Recognizer Hears Valid Utterahce & 
Passes Utterance in Text Form to IVR 



410. 



rVR Passes Utterance (Text) to Interpreter (via RI API, CP) 



412, 



I 



Interpreter Looks Up Text In Corpus (list + token) 
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416^ 
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420^ 

422^ 

424 „ 

426, 
428, 

430, 



Interpreter Passes Token to IVR 



IVR Gets Variable (if needed) 



Watt for Next Utterance 



I 



Another Utterance 



Close Session 



Wait to Listen Again 



Listen Again 



Close Appl ication 



Shutdovm Interpreter 



Shutdovm Speech Recognizer 



Figure 4 



Shutdown IVR 
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